Goto

Collaborating Authors

 hierarchical auto-regressive modeling


Locally Hierarchical Auto-Regressive Modeling for Image Generation Supplementary Document A Implementation Details A.1 HQ-V AE

Neural Information Processing Systems

The input of the main transformer starts with the start-of-sentence (SOS) token. Our implementation is based on PyTorch 1.10 HQ-TV AE in Table B. When learning the resizing operations, we apply two different loss functions, Figure C: Examples of reconstructed images using HQ-V AE with the learnable down-and up-sampling layers. We set the number of self-attention blocks in IET to 1 or 2, i.e., We propose locally hierarchical decoding in PHT contrary to the standard sequential approach by assuming the conditional independence among bottom codes given a top code. The ablation study Table C(b) demonstrates the benefit of our decoding strategy in the PHT with respect to image generation quality. We use the smallest model HQ-Transformer (S) to verify architectural choices.Input embedding Decoding policy Label type ( top-k,t) FID Precision Recall (a) Addition Locally hierarchical conditioning One-hot label (2048, 0.9) 11.03 0.70 0.55 IET N B.4 Soft-Labeling in HQ-Transformer Table C(c) shows that soft-labeling improves FID compared to one-hot labeling.


Locally Hierarchical Auto-Regressive Modeling for Image Generation

Neural Information Processing Systems

We propose a locally hierarchical auto-regressive model with multiple resolutions of discrete codes. In the first stage of our algorithm, we represent an image with a pyramid of codes using Hierarchically Quantized Variational AutoEncoder (HQ-VAE), which disentangles the information contained in the multi-level codes. For an example of two-level codes, we create two separate pathways to carry high-level coarse structures of input images using top codes while compensating for missing fine details by constructing a residual connection for bottom codes. An appropriate selection of resizing operations for code embedding maps enables top codes to capture maximal information within images and the first stage algorithm achieves better performance on both vector quantization and image generation. The second stage adopts Hierarchically Quantized Transformer (HQ-Transformer) to process a sequence of local pyramids, which consist of a single top code and its corresponding bottom codes.